scientific claim verification
MuSciClaims: Multimodal Scientific Claim Verification
Lal, Yash Kumar, Bandham, Manikanta, Hasan, Mohammad Saqib, Kashi, Apoorva, Koupaee, Mahnaz, Balasubramanian, Niranjan
Assessing scientific claims requires identifying, extracting, and reasoning with multimodal data expressed in information-rich figures in scientific literature. Despite the large body of work in scientific QA, figure captioning, and other multimodal reasoning tasks over chart-based data, there are no readily usable multimodal benchmarks that directly test claim verification abilities. To remedy this gap, we introduce a new benchmark MuSciClaims accompanied by diagnostics tasks. We automatically extract supported claims from scientific articles, which we manually perturb to produce contradicted claims. The perturbations are designed to test for a specific set of claim verification capabilities. We also introduce a suite of diagnostic tasks that help understand model failures. Our results show most vision-language models are poor (~0.3-0.5 F1), with even the best model only achieving 0.72 F1. They are also biased towards judging claims as supported, likely misunderstanding nuanced perturbations within the claims. Our diagnostics show models are bad at localizing correct evidence within figures, struggle with aggregating information across modalities, and often fail to understand basic components of the figure.
- North America > United States > Washington > King County > Seattle (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification
Wang, Chengye, Shen, Yifei, Kuang, Zexi, Cohan, Arman, Zhao, Yilun
We introduce SciVer, the first benchmark specifically designed to evaluate the ability of foundation models to verify claims within a multimodal scientific context. SciVer consists of 3,000 expert-annotated examples over 1,113 scientific papers, covering four subsets, each representing a common reasoning type in multimodal scientific claim verification. To enable fine-grained evaluation, each example includes expert-annotated supporting evidence. We assess the performance of 21 state-of-the-art multimodal foundation models, including o4-mini, Gemini-2.5-Flash, Llama-3.2-Vision, and Qwen2.5-VL. Our experiment reveals a substantial performance gap between these models and human experts on SciVer. Through an in-depth analysis of retrieval-augmented generation (RAG), and human-conducted error evaluations, we identify critical limitations in current open-source models, offering key insights to advance models' comprehension and reasoning in multimodal scientific literature tasks.
- Asia > Thailand > Bangkok > Bangkok (0.05)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- Asia > Singapore (0.04)
- (4 more...)
SciClaimHunt: A Large Dataset for Evidence-based Scientific Claim Verification
Kumar, Sujit, Sharma, Anshul, Khincha, Siddharth Hemant, Shroff, Gargi, Singh, Sanasam Ranbir, Mishra, Rahul
Verifying scientific claims presents a significantly greater challenge than verifying political or news-related claims. Unlike the relatively broad audience for political claims, the users of scientific claim verification systems can vary widely, ranging from researchers testing specific hypotheses to everyday users seeking information on a medication. Additionally, the evidence for scientific claims is often highly complex, involving technical terminology and intricate domain-specific concepts that require specialized models for accurate verification. Despite considerable interest from the research community, there is a noticeable lack of large-scale scientific claim verification datasets to benchmark and train effective models. To bridge this gap, we introduce two large-scale datasets, SciClaimHunt and SciClaimHunt_Num, derived from scientific research papers. We propose several baseline models tailored for scientific claim verification to assess the effectiveness of these datasets. Additionally, we evaluate models trained on SciClaimHunt and SciClaimHunt_Num against existing scientific claim verification datasets to gauge their quality and reliability. Furthermore, we conduct human evaluations of the claims in proposed datasets and perform error analysis to assess the effectiveness of the proposed baseline models. Our findings indicate that SciClaimHunt and SciClaimHunt_Num serve as highly reliable resources for training models in scientific claim verification.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.04)
- Europe > Greece > Central Macedonia > Thessaloniki (0.04)
- (2 more...)
SciFact-Open: Towards open-domain scientific claim verification
Wadden, David, Lo, Kyle, Kuehl, Bailey, Cohan, Arman, Beltagy, Iz, Wang, Lucy Lu, Hajishirzi, Hannaneh
While research on scientific claim verification has led to the development of powerful systems that appear to approach human performance, these approaches have yet to be tested in a realistic setting against large corpora of scientific literature. Moving to this open-domain evaluation setting, however, poses unique challenges; in particular, it is infeasible to exhaustively annotate all evidence documents. In this work, we present SciFact-Open, a new test collection designed to evaluate the performance of scientific claim verification systems on a corpus of 500K research abstracts. Drawing upon pooling techniques from information retrieval, we collect evidence for scientific claims by pooling and annotating the top predictions of four state-of-the-art scientific claim verification models. We find that systems developed on smaller corpora struggle to generalize to SciFact-Open, exhibiting performance drops of at least 15 F1. In addition, analysis of the evidence in SciFact-Open reveals interesting phenomena likely to appear when claim verification systems are deployed in practice, e.g., cases where the evidence supports only a special case of the claim. Our dataset is available at https://github.com/dwadden/scifact-open.
- North America > United States > Washington > King County > Seattle (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Jordan (0.04)
LongChecker: Improving scientific claim verification by modeling full-abstract context
Wadden, David, Lo, Kyle, Wang, Lucy Lu, Cohan, Arman, Beltagy, Iz, Hajishirzi, Hannaneh
We introduce the LongChecker system for scientific claim verification. Given a scientific claim and an evidence-containing research abstract, LongChecker predicts a veracity label and identifies supporting rationales in a multitask fashion based on a shared encoding of the claim and abstract. We perform experiments on the SciFact dataset, and find that LongChecker achieves state-of-the-art performance. We conduct analysis to understand the source of this improvement, and find that identifying the relationship between a claim and a rationale reporting a scientific finding often requires understanding the context in which the rationale appears. By making labeling decisions based on all available context, LongChecker achieves better performance on cases requiring this type of understanding. In addition, we show that LongChecker is able to leverage weakly-supervised in-domain data to facilitate few-shot domain adaptation for scientific claim verification.
- North America > United States > Washington > King County > Seattle (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)